HuQ: An English-Hungarian Corpus for Quality Estimation
نویسندگان
چکیده
Quality estimation for machine translation is an important task. The standard automatic evaluation methods that use reference translations cannot perform the evaluation task well enough. These methods produce low correlation with human evaluation for English-Hungarian. Quality estimation is a new approach to solve this problem. This method is a prediction task estimating the quality of translations for which features are extracted from only the source and translated sentences. Quality estimation systems have not been implemented for Hungarian before, thus there is no such training corpus either. In this study, we created a dataset to build quality estimation models for English-Hungarian. We also did experiments to optimize the quality estimation system for Hungarian. In the optimization task we did research in the field of feature engineering and feature selection. We created optimized feature sets, which produced better results than the baseline feature set.
منابع مشابه
Light Verb Constructions in the SzegedParalellFX English-Hungarian Parallel Corpus
In this paper, we describe the first English–Hungarian parallel corpus annotated for light verb constructions, which contains 14,261 sentence alignment units. Annotation principles and statistical data on the corpus are also provided, and English and Hungarian data are contrasted. On the basis of corpus data, a database containing pairs of English–Hungarian light verb constructions has been cre...
متن کاملExploiting Parallel Corpora for Supervised Word-Sense Disambiguation in English-Hungarian Machine Translation
In this paper we present an experiment to automatically generate annotated training corpora for a supervised word sense disambiguation module operating in an English-Hungarian and a Hungarian-English machine translation system. Training examples for the WSD module are produced by annotating ambiguous lexical items in the source language (words having several possible translations) with their pr...
متن کاملemLam - a Hungarian Language Modeling baseline
This paper aims to make up for the lack of documented baselines for Hungarian language modeling. Various approaches are evaluated on three publicly available Hungarian corpora. Perplexity values comparable to models of similar-sized English corpora are reported. A new, freely downloadable Hungarian benchmark corpus is introduced.
متن کاملSentence Alignment of Hungarian-English Parallel Corpora Using a Hybrid Algorithm
We present an e cient hybrid method for aligning sentences with their translations in a parallel bilingual corpus. The new algorithm is composed of a length-based and anchor matching method that uses Named Entity recognition. This algorithm combines the speed of length-based models with the accuracy of anchor nding methods. The accuracy of nding cognates for Hungarian-English language pair is e...
متن کاملA polyglot domain optimised text-to-speech system for railway station announcements
Announcements at railway stations are a major information source for passengers. In order to ensure high intelligibility, the traditional solution is to use recorded prompts with “slot filling” of variable data. If a data type (e.g. train name) changes new recordings have to be made. Even with careful design the quality of the system will gradually deteriorate due to change of the voice of the ...
متن کامل